Data Exploration and Visualisation
You have raw data, now what?
Subset, summarise, create, merge…
wrangling = manipulation = munging
Fly-by tour, focusing on common operations
View()head()# A tibble: 7 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
# ℹ 2 more variables: sex <fct>, year <int>
str()tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
names()[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
colnames()dplyr::rename() [1] "miles_per_gallon" "cylinders" "disp" "hp"
[5] "drat" "wt" "qsec" "vs"
[9] "am" "gear" "carb"
Subsetting by index
Drop rows or columns with negative indices
Subset using logical vector
Subset using names (note different objects)
{dplyr} (1)# A tibble: 344 × 3
species island body_mass_g
<fct> <fct> <int>
1 Adelie Torgersen 3750
2 Adelie Torgersen 3800
3 Adelie Torgersen 3250
4 Adelie Torgersen NA
# ℹ 340 more rows
{dplyr} (2)penguins %>%
select(species, island, body_mass_g) %>%
filter(!(body_mass_g > 6000)) %>%
print(n = 4)# A tibble: 340 × 3
species island body_mass_g
<fct> <fct> <int>
1 Adelie Torgersen 3750
2 Adelie Torgersen 3800
3 Adelie Torgersen 3250
4 Adelie Torgersen 3450
# ℹ 336 more rows
# remove wt column
cars_renamed <- cars_renamed[ ,-which(names(cars_renamed) == "wt")]
head(cars_renamed, n = 3) miles_per_gallon cylinders displacement horse_power
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
rear_axel_ratio qsec vs am gear carb weight
Mazda RX4 3.90 16.46 0 1 4 4 2.620
Mazda RX4 Wag 3.90 17.02 0 1 4 4 2.875
Datsun 710 3.85 18.61 1 1 4 1 2.320
dplyr::mutate()cars_renamed <- cars_renamed %>%
mutate(weight_kg = weight * 1000)
cars_renamed %>%
select(weight, weight_kg) %>%
head(n = 3) weight weight_kg
Mazda RX4 2.620 2620
Mazda RX4 Wag 2.875 2875
Datsun 710 2.320 2320
model mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# A tibble: 1 × 1
average_bill_length_mm
<dbl>
1 43.9
bill_length_mm_summary <- penguins %>%
summarise(
mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
min = max(bill_length_mm, na.rm = TRUE),
q_0 = min(bill_length_mm, na.rm = TRUE),
q_1 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_2 = median(bill_length_mm, na.rm = TRUE),
q_3 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_4 = max(bill_length_mm, na.rm = TRUE))
bill_length_mm_summary# A tibble: 1 × 8
mean median min q_0 q_1 q_2 q_3 q_4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 43.9 44.4 59.6 32.1 39.2 44.4 39.2 59.6
penguins %>%
group_by(species) %>%
summarise(
mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
min = max(bill_length_mm, na.rm = TRUE),
q_0 = min(bill_length_mm, na.rm = TRUE),
q_1 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_2 = median(bill_length_mm, na.rm = TRUE),
q_3 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_4 = max(bill_length_mm, na.rm = TRUE))# A tibble: 3 × 9
species mean median min q_0 q_1 q_2 q_3 q_4
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie 38.8 38.8 46 32.1 36.8 38.8 36.8 46
2 Chinstrap 48.8 49.6 58 40.9 46.3 49.6 46.3 58
3 Gentoo 47.5 47.3 59.6 40.9 45.3 47.3 45.3 59.6
penguin_summary_stats <- penguins %>%
group_by(species, island) %>%
summarise(
mean = mean(bill_length_mm, na.rm = TRUE),
median = median(bill_length_mm, na.rm = TRUE),
min = max(bill_length_mm, na.rm = TRUE),
q_0 = min(bill_length_mm, na.rm = TRUE),
q_1 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_2 = median(bill_length_mm, na.rm = TRUE),
q_3 = quantile(bill_length_mm, prob = 0.25, na.rm = TRUE),
q_4 = max(bill_length_mm, na.rm = TRUE))
penguin_summary_stats# A tibble: 5 × 10
# Groups: species [3]
species island mean median min q_0 q_1 q_2 q_3 q_4
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie Biscoe 39.0 38.7 45.6 34.5 37.7 38.7 37.7 45.6
2 Adelie Dream 38.5 38.6 44.1 32.1 36.8 38.6 36.8 44.1
3 Adelie Torgersen 39.0 38.9 46 33.5 36.7 38.9 36.7 46
4 Chinstrap Dream 48.8 49.6 58 40.9 46.3 49.6 46.3 58
5 Gentoo Biscoe 47.5 47.3 59.6 40.9 45.3 47.3 45.3 59.6
By default each summarise() will undo one level of grouping.
Use an appropriate number of calls or ungroup().
# A tibble: 3 × 10
species island mean median min q_0 q_1 q_2 q_3 q_4
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie NA 38.8 38.7 45.2 33.4 37.0 38.7 37.0 45.2
2 Chinstrap NA 48.8 49.6 58 40.9 46.3 49.6 46.3 58
3 Gentoo NA 47.5 47.3 59.6 40.9 45.3 47.3 45.3 59.6
# A tibble: 5 × 10
species island mean median min q_0 q_1 q_2 q_3 q_4
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Adelie Biscoe 39.0 38.7 45.6 34.5 37.7 38.7 37.7 45.6
2 Adelie Dream 38.5 38.6 44.1 32.1 36.8 38.6 36.8 44.1
3 Adelie Torgersen 39.0 38.9 46 33.5 36.7 38.9 36.7 46
4 Chinstrap Dream 48.8 49.6 58 40.9 46.3 49.6 46.3 58
5 Gentoo Biscoe 47.5 47.3 59.6 40.9 45.3 47.3 45.3 59.6
Useful {forcats} functions:
fct_reorder(): Reordering a factor by another variable.fct_infreq(): Reordering a factor by the frequency of values.fct_relevel(): Changing the order of a factor by hand.fct_lump(): Collapsing the least/most frequent values of a factor into “other”.Check out the forcats vignette or the factors chapter of R4DS.
Working with text data is it’s own skill.
Requires some knowledge of regular expressions.
{stringr} simplifies this, somewhat.
Learn as you need it: strings section of R4DS.
Recall ISO standard and proceed with caution
\[ \text{YYYY} - \text{MM} - \text{DD}\]
{lubridate} makes this easier, see dates and times examples from R4DS.
Data on the same observational units stored across two or more tables.
Relational data chapter of R4DS.
Inner Join
{dplyr}, translate with {dbplyr}Learned how to wrangle tabular data in R with base R and {dplyr}
Met the idea of relational data and {dplyr}’s relationship to SQL
Become aware of some tricky data types and packages that can help.
Effective Data Science: EDAV - Wrangling - Zak Varty